November 15, 2020
On 32 bits, we divid the bits with 1 sign bit s, 8 exponent bits e and the remaining 23 bits for the fractional part:
The formula for decoding a 32-bit floating point number is as follows:
$$n_(10) = (-1)^s * 2^e * ( 1 + \sum_i b_(23-i) * 2^(-i))$$
where n_(10) is the resulting decimal number, s is the sign bit (most significant bit), e is the decimal value corresponding to the 8 exponent bits and b_i are the bits number i.
The most significant bit (bit 31) is the sign bit. 0
means we encoded a positive number, and 1
is negative.
The exponent e is not encoded using the two's complement representation, but with a different one: the offset-binary representation with the zero offset being 127. This means that 0000 \, 0000_2 represents -126, 1000 \, 0000_2 represents 0 and 1111 \, 1111_2 represents 127.
The fractional part of the number is encoded with standard binary encoding. There is a simple method to convert a decimal fractional part into binary:
For example, for 0.345:
Multiply by 2 | Integer part | Fraction part | Bit number in 32-bit representation |
---|---|---|---|
0.345 * 2 = 0.690 | 0 | 0.690 | 22 |
0.690 * 2 = 1.380 | 1 | 0.380 | 21 |
0.380 * 2 = 0.760 | 0 | 0.760 | 20 |
0.760 * 2 = 1.520 | 1 | 0.520 | 19 |
0.520 * 2 = 1.040 | 1 | 0.040 | 18 |
0.040 * 2 = 0.080 | 0 | 0.080 | 17 |
.. | .. | .. | .. |
0.880 * 2 = 1.760 | 1 | 0.760 | 0 |
The fractional part is stored with 23 bits. This allows a precision of between 7 and 9 significant digits (2^(23) = 8 \, 388 \, 608). The exponent is stored on 8 bits, which allows numbers from 2^(-126) \approx 1.175 * 10^(-38) to 2^(127) \approx 1.701 * 10^(38).